Infosec Jupyterthon '24: Threat Hunting in Three Dimensions¶

Abstract: Threat hunting often demands capabilities beyond the scope of SIEM platforms. This presentation showcases a threat hunting workflow that leverages Jupyter for rapid, iterative, and visual analysis of complex data. By tapping into humans' innate understanding of three dimensions, we will demonstrate how to calculate and re-calculate metrics and distances between data points. Specifically, we focus on comparing attributes of Google Chrome Extensions for similarity in Euclidean space, allowing interactive exploration of data and a deeper understanding of relationships between data points. This approach helps uncover instances of masquerading within the extensions.

No description has been provided for this image No description has been provided for this image

Dr. Ryan Fetterman (rfetterman@splunk.com / X: @iknowuhack)¶

Background¶

🧪 SURGe Security Research Team @ Splunk ¶

SURG1 SURGe2 SURGe3

⛰️ PEAK Threat Hunting Framework ¶

peak1

🤖 Model-Assisted Threat Hunting (M-ATH) ¶

peak1

🔬 Data Science and Deep Learning with Splunk ¶

peak1

Today's Workflow¶

A means for exploring multivariate data with at least 3 quantitative measures.

Pre-processing:¶

    1. Ingest data into Splunk
    1. Filter / Normalize / Enrich
    1. Push / Pull into Jupyter via DSDL

Exploration:¶

    1. Scale and Orient Feature Vectors
    1. Cluster via K-Means
    1. 3-D Scatter-plot

Analysis:¶

    1. Measure Euclidean Distance
peak1
(source: https://hlab.stanford.edu/brian/euclidean_distance_in.html)

This data exploration process is largely problem-agnostic, but we will make sense out of it through an example...¶

Case Study: Chrome Browser Extension Web store¶

No description has been provided for this image

Security & Analysis Challenges:

  • Diverse functionality:
    • Password managers, Adblockers, Translation, Coupon Trackers...
  • Diverse composition:
    • Javascript, HTML, CSS, JSON, Web APIs...

In a sea of 140,000+ browser extensions, how can we find the imposters?

Hard to solve this problem at-scale, but a good threat hunting can make this topic approachable.

Baseline Data¶

Here is our baseline data (crx, name, description) + extension icons

crx name description
pfmhnjhlejjncbbmkopeeinhiolpccon VLVical Export für das VLV der TU-Ilmenau
pekgkbpcpmjdbkdiinpfojfgmfabieej WP-Stars New Tab Displays a customisable nicely designed New Tab Page.
pabeminldebomngnkgffiejipjjaaogi GoogleGPT - ChatGPT on google ChatGPT on all google searches.
ofldebdjlgdgokeokgacgoekofgokioe Tweet This Comparte la URL en Twitter.
nkenhionmhdegjkgghhigaifcmpioeff Youtube Scripter AI-powered transcription, summarization, translation, and script export with YouTubeScripter.
ndnefdpoldbalhfejpafdiajlciblpoa What's it worth? (The Original) What's your stuff worth on eBay? Find out!
mjmkcadjgnpdfpeodlincmeoedhihdmg Short Links Search Searches for various links
medllgheccmbihkbmplflablnlkamacf Sibling ASINs Display sibling ASINs on Amazon.com
lonmkndmggfaifodhdppcijhcbbfppie Flexi Video Browse and watch videos about books, authors and publishers. Watch the most popular book reviews, video trailer and news.
lgpfmglfagconknpjlninmhnmncncgdb Free Trial Extension! Free trial extension for test.

🎭 A masquerade will look like, sound like, or be described like our target extension...

Model-Assisted Threat Hunting (M-ATH) for Masquerading Extensions¶

We will enrich our baseline data with quantitative metrics to hunt for similarity that could suggest masquerading, via:

  • Levenshtein similarity between Extension Names,
  • Color Moment similarity between Extension Icons,
  • Cosine Similarity between Extension descriptions,
  • Unsupervised Learning to cluster and visualize extension in 3-D Scatterplot,
  • Euclidean Distance as a composite similarity score.

Set Similarity Target¶

Enter the CRX identifier to set as the basis for comparison against the rest of the Chrome Web Store. E.g.:

  • LinkedIn Extension: meajfmicibjppdgbjfkpdikfjcflabpk
  • Google Translate: aapbdbdomjkkjkaonfhkkikfgjllcleb
  • Zoom Chrome Extension: kgjfgplpablkjnlkjmjdecgdpfankdle
  • Honey: bmnlcjabgnpnenekpadlanbbkooimhnj
# Target CRX Identifier
crx = 'aapbdbdomjkkjkaonfhkkikfgjllcleb'

# Get the reference row index
matching_rows = df[df['crx'] == crx]
reference_row_index = matching_rows.index[0]  # Get the index of the first matching row
reference_name = df.loc[reference_row_index, "name"]
reference_desc = df.loc[reference_row_index, "description"]

print(f"Name: " + reference_name)
print(f"Description: " + reference_desc)
Name: Google Translate
Description: View translations easily as you browse the web. By the Google Translate team.

Similarity Enrichment¶

Pre-process the hashes and generate the similarity metrics.

# Calculate Levenshtein distances and add them as a new column with tqdm progress bar
calculate_levenshtein_distance(reference_name, df)

# Calculate color moment hamming distance
calculate_cm_hamming_distances(df, reference_row_index)

calculate_jaccard_similarity(reference_desc, df)

# Calculate cosine similarity and store the scores in the DataFrame with tqdm progress bar
with tqdm(total=len(df), desc="Calculating Cosine Similarity of Descriptions") as pbar:
    df = calculate_cosine_similarity(reference_desc, df)
    pbar.update(len(df))
    
# Calculate Hamming distances and find closest matches    
find_closest_matches(df, reference_row_index)
Calculating Levenshtein Similarity from Name: 100%|████████████████████████████████████████| 140446/140446 [00:00<00:00, 822659.02it/s]
Calculating Hamming Similarity of Color Moment Hash: 100%|██████████████████████████████████| 140446/140446 [00:01<00:00, 93871.35it/s]
Calculating Jaccard Similarity of Description: 100%|███████████████████████████████████████| 140446/140446 [00:00<00:00, 268120.20it/s]
Calculating Cosine Similarity of Descriptions: 100%|███████████████████████████████████████| 140446/140446 [00:01<00:00, 111549.81it/s]
Calculating Hamming Similarity of Perceptual Hash: 100%|████████████████████████████████████| 140446/140446 [00:02<00:00, 64254.79it/s]
Measuring Euclidean Distance between Similarity Metrics: 100%|███████████████████████████████| 140446/140446 [01:11<00:00, 1976.21it/s]

Naming Similarity¶

The Levenshtein Similarity is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e., insertions, deletions, or substitutions) required to change one word into the other. In this case, we invert the metric to match the orientation of our other similarity metrics.

df[['name','description', 'crx', 'levenshtein_distance']].head(11)
name description crx levenshtein_distance
0 Google Translate View translations easily as you browse the web... aapbdbdomjkkjkaonfhkkikfgjllcleb 1.000000
1 Simple Translate Quickly translate selected or typed text on we... ibplnjkanclpjokhdolnendpplpjiace 0.188732
2 Edge Translate Translate what you want. bocbaocobfecmglnmeaeppambideimao 0.188732
3 Go Translate Translation Plug-in from CDAC-GIST cfmeoigobgkgnepgmpbecadegpcenllg 0.188732
4 PokeTranslate Translates Pokemon-related Japanese words to E... hhnjbiglgjbjdfookhpjfnlkhpbckoij 0.154930
5 GPT Translate Summarizes web page content in the language of... ljfjmbdgbebmjbfmdneeimenolagonol 0.154930
6 Trance Translate Trance is a easy minimalist translator fnhpjnlhllbbpfaapjfcpbbninjigjjo 0.154930
7 Pro Translate Translate selected text on the web page and co... ggbiakgkfnpekepnjlocbbhmlcbfmfai 0.154930
8 Call Google Translate 使用谷歌翻译(https://github.com/mantou132/GoogleTran... hjaohjgedndjjaegicnfikppfjbboohf 0.154930
9 Cool Translator Translate words on the page. Type in and trans... cifbpdjhjkopeekabdgfjgmcbcgloioi 0.154930
10 Sports Translate Chrome Extension for Sports Translate Customers opjgedcdgkgjbhepddoloeagbcjfdoog 0.154930

Icon Similarity¶

Icon similarity is assessed based on Color Moment Hash, a compact representation (a hash) of an image based on the statistical moments of its color components. This hash is valuable for image comparison because it encapsulates significant information about the image's color distribution while being relatively insensitive to small changes or distortions in the image.

df = pd.read_csv('cm_hamming_distance_output.csv')
df[['name', 'description', 'crx', 'cm_hamming_distance']].head(20)
name description crx cm_hamming_distance
0 Google Translate View translations easily as you browse the web... aapbdbdomjkkjkaonfhkkikfgjllcleb 1.000000
1 fanar شاهد الترجمات بسهولة أثناء تصفح الويب. بواسطة ... jmepjkkakagfokdpijkhdfajnkdncbmn 1.000000
2 A Inner Translate add additional google translate to page! ngjmejllkjigibdhaidcaeemepnfbmej 1.000000
3 快捷插件 快速打开谷歌翻译页面 okojpfcopjjbgejafdmkeijaniplohpi 0.930909
4 fix RTL translate fix RTL translate From https://bidar.app gcojlhljcpgbagiboedilgcoalmpjaaj 0.746667
5 Twitter Force MK Forces MK lang instead of BG for twitter web =) mkoldlpnnhjekhdnbjjmebfbkkmbgoci 0.691892
6 Maple NewTab Enhance your new tab experience with a comfort... fobmbldflolfooglijmbibmnhoflbjlb 0.691892
7 UEF Attendance Check Điểm danh sinh viên dành cho Trường Đại Học Ki... lcddkoaaaiagpikmeeoaecijnpogjpfa 0.640000
8 MyChat El chatbot que organiza la documentación de tu... ckggpbggopidgpbnefogdbnajebobhmb 0.640000
9 AnyTranslate Translate text anywhere hhcjlckencdgngjkbbpoffncomjajegm 0.640000
10 Translator Themes Settings (TTS) Translator Themes Settings - This is an open s... fikcdhfopokbnadlkhheplknciabokag 0.640000
11 Dark Mode For Google Translate Toggle between normal light mode and dark mode... ghobnecdmkccjpaecanmpndfjjimhkmg 0.640000
12 Hong Kong Language Hong Kong Language nagnoddoploniajnljjinfdabfefjffi 0.640000
13 Deck transfer for Yu-Gi-Oh! Master Duel Import and export Yu-Gi-Oh! decks from Master ... lgcpomfflpfipndmldmgblhpbnnfidgk 0.640000
14 Webfont Previewer This extension allows you to test webfonts out... ehmpabgeehikhdodemjoenbonjkdeopn 0.640000
15 Snap Video Controller 指パッチンで動画を再生・停止 boohhlbipnjcfijdhiaagiomalbfacdh 0.590769
16 Tiny Tags: instant query params Tiny Tags is a Chrome extension that simplifie... adjhigahlbnjoiaoaoignnhfablfcoba 0.590769
17 Translator Translate words and phrases while browsing the... pnpdnibdembnnlaiibkeandepjajegoi 0.590769
18 포우 주작기 누구나 쉽게 주작을! dcndpmpkigmkohoajbfjlnaliplgphbk 0.590769
19 Hey Boy Replaces all images on a given page with pictu... jnkckehcibleladajcnejjbiadbkodng 0.590769
def display_icons_with_names_and_hamming(icon_urls, names, hamming_distances):
    # Create a base HTML template
    html_str = '<table><tr>{}</tr></table>'
    image_str = ''

    for name, url, hamming in zip(names, icon_urls, hamming_distances):
        image_str += (
            f'<td style="text-align: center;">'
            f'<img src="{url}" style="max-width: 100px; max-height: 100px;"><br>'
            f'{name}<br>'
            f'Hamming Similarity: {hamming}'
            f'</td>'
        )

    # Insert the constructed image strings into the HTML template
    display(HTML(html_str.format(image_str)))

# Function call to display icons with names and Hamming distances
#display_icons_with_names_and_hamming(icon_urls, top_5_extension_names, top_5_extension_hamming)
No description has been provided for this image

Description Similarity¶

Cosine Similarity is a metric used to compare the similarity of the description fields of text. Each text is represented as a vector, where each dimension corresponds to a word from the combined set of words in both texts, and the value in each dimension corresponds to the weight of that word in the text. Cosine similarity is then used to find the cosine of the angle between these two vectors.

df[['name', 'description', 'cosine_similarity']].head(6)
name description cosine_similarity
0 Google Translate View translations easily as you browse the web. By the Google Translate team. 1.000000
1 AG Translate View translations easily as you browse the web. 0.800077
2 Flow Browser plugin View translations easily as you browse the web. 0.800077
3 AG Translate View translations easily as you browse the web. 0.800077
4 SZTAKI Dictionary Extension Translate easily as you browse the web. 0.666127
5 Dictionary and Flashcards View translations and add flashcards easily as you browse the web. 0.654372

Aggregate Clustering / Analysis¶

K-means is an unsupervised learning algorithm that partitions a dataset into *K* distinct, non-overlapping clusters based on the attributes of the data points -- in this case, the Levenshtein, Cosine, and Color Moment Hamming similarity measures. The algorithm aims to minimize the within-cluster variances and maximize the between-cluster variances, meaning that it seeks to create clusters where members of the same cluster are as similar as possible while also being as different as possible from members of other clusters.

Closest Peers by Euclidean Distance¶

peak1
(source: https://hlab.stanford.edu/brian/euclidean_distance_in.html)
Euclid Image

Using our analyis approach, we can quickly narrow down our list of 140,000 to a handful of candidates for deeper analysis!

Recap¶

Pre-processing:¶

    1. Ingest data into Splunk
    1. Filter / Normalize / Enrich
    1. Push / Pull into Jupyter via DSDL

Exploration:¶

    1. Scale and Orient Feature Vectors
    1. Cluster via K-Means
    1. 3-D Scatter-plot

Analysis:¶

    1. Measure Euclidean Distance

Thank you!¶

https://github.com/splunk/PEAK/tree/main/similarity_comp_via_euclidean_distance¶

https://github.com/fetterm4n/infosec-jupyterthon¶

https://twitter.com/iknowuhack¶

No description has been provided for this image